Mastering Apache Spark 2.x by Romeo Kienzler

Mastering Apache Spark 2.x by Romeo Kienzler

Author:Romeo Kienzler
Language: eng
Format: mobi
Tags: COM062000 - COMPUTERS / Data Modeling and Design, COM018000 - COMPUTERS / Data Processing, COM051280 - COMPUTERS / Programming Languages / Java
Publisher: Packt
Published: 2018-07-18T10:35:15+00:00


object bayes1 extends App {

The HDFS data file is again defined, and a Spark context is created as before:

val hdfsServer = "hdfs://localhost:8020"

val hdfsPath = "/data/spark/nbayes/"

val dataFile = hdfsServer+hdfsPath+"DigitalBreathTestData2013-MALE2a.csv"

val sparkMaster = "spark://loclhost:7077"

val appName = "Naive Bayes 1"

val conf = new SparkConf()

conf.setMaster(sparkMaster)

conf.setAppName(appName)

val sparkCxt = new SparkContext(conf)

The raw CSV data is loaded and split by the separator characters. The first column becomes the label (Male/Female) that the data will be classified on. The final columns separated by spaces become the classification features:

val csvData = sparkCxt.textFile(dataFile)

val ArrayData = csvData.map {

csvLine =>

val colData = csvLine.split(',')

LabeledPoint(colData(0).toDouble,

Vectors.dense(colData(1)

.split('')

.map(_.toDouble)

)

)

}

The data is then randomly divided into training (70%) and testing (30%) datasets:

val divData = ArrayData.randomSplit(Array(0.7, 0.3), seed = 13L)

val trainDataSet = divData(0)

val testDataSet = divData(1)

The Naive Bayes MLlib function can now be trained using the previous training set. The trained Naive Bayes model, held in the nbTrained variable, can then be used to predict the Male/Female result labels against the testing data:

val nbTrained = NaiveBayes.train(trainDataSet)

val nbPredict = nbTrained.predict(testDataSet.map(_.features))

Given that all of the data already contained labels, the original and predicted labels for the test data can be compared. An accuracy figure can then be computed to determine how accurate the predictions were, by comparing the original labels with the prediction values:

val predictionAndLabel = nbPredict.zip(testDataSet.map(_.label))

val accuracy = 100.0 * predictionAndLabel.filter(x => x._1 == x._2).count() /

testDataSet.count()

println( "Accuracy : " + accuracy );

}

So, this explains the Scala Naive Bayes code example. It's now time to run the compiled bayes1 application using spark-submit and determine the classification accuracy. The parameters are the same. It's just the class name that has changed:

spark-submit \

--class bayes1 \

--master spark://hc2nn.semtech-solutions.co.nz:7077 \

--executor-memory 700M \

--total-executor-cores 100 \

/home/hadoop/spark/nbayes/target/scala-2.10/naive-bayes_2.10-1.0.jar

The resulting accuracy given by the Spark cluster is just 43 percent, which seems to imply that this data is not suitable for Naive Bayes:

Accuracy: 43.30

Luckily we'll introduce artificial neural networks later in the chapter, a more powerful classifier. In the next example, we will use K-Means to try to determine what clusters exist within the data. Remember, Naive Bayes needs the data classes to be linearly separable along the class boundaries. With K-Means, it will be possible to determine both: the membership and centroid location of the clusters within the data.



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Popular ebooks
Whisky: Malt Whiskies of Scotland (Collins Little Books) by dominic roskrow(73919)
What's Done in Darkness by Kayla Perrin(26960)
The Ultimate Python Exercise Book: 700 Practical Exercises for Beginners with Quiz Questions by Copy(20859)
De Souza H. Master the Age of Artificial Intelligences. The Basic Guide...2024 by Unknown(20613)
D:\Jan\FTP\HOL\Work\Alien Breed - Tower Assault CD32 Alien Breed II - The Horror Continues Manual 1.jpg by PDFCreator(20538)
The Fifty Shades Trilogy & Grey by E L James(19460)
Shot Through the Heart: DI Grace Fisher 2 by Isabelle Grey(19381)
Shot Through the Heart by Mercy Celeste(19242)
Wolf & Parchment: New Theory Spice & Wolf, Vol. 10 by Isuna Hasekura and Jyuu Ayakura(17388)
Python GUI Applications using PyQt5 : The hands-on guide to build apps with Python by Verdugo Leire(17356)
Peren F. Statistics for Business and Economics...Essential Formulas 3ed 2025 by Unknown(17188)
Wolf & Parchment: New Theory Spice & Wolf, Vol. 03 by Isuna Hasekura and Jyuu Ayakura & Jyuu Ayakura(17099)
Wolf & Parchment: New Theory Spice & Wolf, Vol. 01 by Isuna Hasekura and Jyuu Ayakura & Jyuu Ayakura(16713)
The Subtle Art of Not Giving a F*ck by Mark Manson(14831)
The 3rd Cycle of the Betrayed Series Collection: Extremely Controversial Historical Thrillers (Betrayed Series Boxed set) by McCray Carolyn(14443)
Stepbrother Stories 2 - 21 Taboo Story Collection (Brother Sister Stepbrother Stepsister Taboo Pseudo Incest Family Virgin Creampie Pregnant Forced Pregnancy Breeding) by Roxi Harding(14219)
Cozy crochet hats: 7 Stylish and Beginner-Friendly Patterns from Baby Beanies to Trendy Bucket Hats by Vanilla Lazy(13504)
Scorched Earth by Nick Kyme(13096)
Reichel W. Numerical methods for Electrical Engineering, Meteorology,...2022 by Unknown(12980)
Drei Generationen auf dem Jakobsweg by Stein Pia(11259)